Skip to content

Conversation

olegmukhin
Copy link

@olegmukhin olegmukhin commented Jul 19, 2025

Added a new LookUp filter to address use case when enrichment of record is required based on simple static key value lookup.

The filter loads a CSV file into a hash table for performance. It consider first column of the CSV to be the key and the second column to be the value. All other columns are ignored.

Where a record value (identified by lookup_key input) matches the key from the CSV, the value from the CSV row is added under a new key (defined by result_key input) to the record.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • [N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • [N/A] Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • New Features

    • Added a "lookup" filter to enrich records from a CSV key/value map; supports configurable source/result fields and optional case-insensitive matching. Robust CSV parsing and preserves originals on no-match; exposes processed/matched/skipped metrics.
  • Tests

    • Added comprehensive runtime tests covering CSV parsing, matching, large datasets, nested/array keys, metrics; tests run conditionally when record accessor support is available.
  • Chores

    • Added build/config entries to enable the filter (ON by default; can be disabled in minimal builds).

@olegmukhin
Copy link
Author

Test configuration

Fluent Bit YAML Configuration

parsers:
  - name: json
    format: json

pipeline:
  inputs:
    - name: tail
      path: /src/devices.log
      read_from_head: true
      parser: json

  filters:
    - name: lookup
      match: "*"
      file: /src/device-bu.csv
      lookup_key: $hostname
      result_key: business_line
      ignore_case: true

  outputs:
    - name: stdout
      match: "*"

To test new filter we will load a range of log values including, strings (different cases), integer, boolean, embedded quotes and other value types.

devices.log

{"hostname": "server-prod-001"}
{"hostname": "Server-Prod-001"}
{"hostname": "db-test-abc"}
{"hostname": 123}
{"hostname": true}
{"hostname": " host with space "}
{"hostname": "quoted \"host\""}
{"hostname": "unknown-host"}
{}
{"hostname": [1,2,3]}
{"hostname": {"sub": "val"}}
{"hostname": " "}

CSV configuration will aim to test key overwrites, different types of strings, use and escaping of quotes.

device-bu.csv

hostname,business_line
server-prod-001,Finance
db-test-abc,Engineering
db-test-abc,Marketing
web-frontend-xyz,Marketing
app-backend-123,Operations
"legacy-system true","Legacy IT"
" host with space ","Infrastructure"
"quoted ""host""", "R&D"
123, "R&D"
true, "R&D"
no-match-host,Should Not Appear

When executed with verbose flag the following out is produced.

Test output

[2025/07/19 14:38:48] [ info] Configuration:
[2025/07/19 14:38:48] [ info]  flush time     | 1.000000 seconds
[2025/07/19 14:38:48] [ info]  grace          | 5 seconds
[2025/07/19 14:38:48] [ info]  daemon         | 0
[2025/07/19 14:38:48] [ info] ___________
[2025/07/19 14:38:48] [ info]  inputs:
[2025/07/19 14:38:48] [ info]      tail
[2025/07/19 14:38:48] [ info] ___________
[2025/07/19 14:38:48] [ info]  filters:
[2025/07/19 14:38:48] [ info]      lookup.0
[2025/07/19 14:38:48] [ info] ___________
[2025/07/19 14:38:48] [ info]  outputs:
[2025/07/19 14:38:48] [ info]      stdout.0
[2025/07/19 14:38:48] [ info] ___________
[2025/07/19 14:38:48] [ info]  collectors:
[2025/07/19 14:38:48] [ info] [fluent bit] version=4.1.0, commit=, pid=50224
[2025/07/19 14:38:48] [debug] [engine] coroutine stack size: 196608 bytes (192.0K)
[2025/07/19 14:38:48] [ info] [storage] ver=1.5.3, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2025/07/19 14:38:48] [ info] [simd    ] disabled
[2025/07/19 14:38:48] [ info] [cmetrics] version=1.0.4
[2025/07/19 14:38:48] [ info] [ctraces ] version=0.6.6
[2025/07/19 14:38:48] [ info] [input:tail:tail.0] initializing
[2025/07/19 14:38:48] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
[2025/07/19 14:38:48] [debug] [tail:tail.0] created event channels: read=25 write=26
[2025/07/19 14:38:48] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2025/07/19 14:38:48] [debug] [input:tail:tail.0] inotify watch fd=31
[2025/07/19 14:38:48] [debug] [input:tail:tail.0] scanning path /src/*.log
[2025/07/19 14:38:48] [debug] [input:tail:tail.0] file will be read in POSIX_FADV_DONTNEED mode /src/devices.log
[2025/07/19 14:38:48] [debug] [input:tail:tail.0] inode=10 with offset=0 appended as /src/devices.log
[2025/07/19 14:38:48] [debug] [input:tail:tail.0] scan_glob add(): /src/devices.log, inode 10
[2025/07/19 14:38:48] [debug] [input:tail:tail.0] 1 new files found on path '/src/*.log'
[2025/07/19 14:38:48] [ info] [filter:lookup:lookup.0] Loaded 10 entries from CSV
[2025/07/19 14:38:48] [debug] [stdout:stdout.0] created event channels: read=33 write=34
[2025/07/19 14:38:48] [ info] [output:stdout:stdout.0] worker #0 started
[2025/07/19 14:38:48] [ info] [sp] stream processor started
[2025/07/19 14:38:48] [ info] [engine] Shutdown Grace Period=5, Shutdown Input Grace Period=2
[2025/07/19 14:38:48] [debug] [filter:lookup:lookup.0] Record 4: lookup value for key '$hostname' is non-string, converted to '123'
[2025/07/19 14:38:48] [debug] [filter:lookup:lookup.0] Record 5: lookup value for key '$hostname' is non-string, converted to 'true'
[2025/07/19 14:38:48] [debug] [filter:lookup:lookup.0] Record 10: lookup_key '$hostname' has type array/map, skipping to avoid ra error
[2025/07/19 14:38:48] [debug] [filter:lookup:lookup.0] Record 11: lookup_key '$hostname' has type array/map, skipping to avoid ra error
[2025/07/19 14:38:48] [debug] [input:tail:tail.0] [static files] processed 278b
[2025/07/19 14:38:48] [debug] [input:tail:tail.0] inode=10 file=/src/devices.log promote to TAIL_EVENT
[2025/07/19 14:38:48] [ info] [input:tail:tail.0] inotify_fs_add(): inode=10 watch_fd=1 name=/src/devices.log
[2025/07/19 14:38:48] [debug] [input:tail:tail.0] [static files] processed 0b, done
[2025/07/19 14:38:49] [debug] [task] created task=0xffff9c043a10 id=0 OK
[2025/07/19 14:38:49] [debug] [output:stdout:stdout.0] task_id=0 assigned to thread #0
[0] tail.0: [[1752935928.516352587, {}], {"hostname"=>"server-prod-001", "business_line"=>"Finance"}]
[1] tail.0: [[1752935928.516443337, {}], {"hostname"=>"Server-Prod-001", "business_line"=>"Finance"}]
[2] tail.0: [[1752935928.516445712, {}], {"hostname"=>"db-test-abc", "business_line"=>"Marketing"}]
[3] tail.0: [[1752935928.516448504, {}], {"hostname"=>123, "business_line"=>"R&D"}]
[4] tail.0: [[1752935928.516450337, {}], {"hostname"=>true, "business_line"=>"R&D"}]
[5] tail.0: [[1752935928.516452004, {}], {"hostname"=>" host with space ", "business_line"=>"Infrastructure"}]
[6] tail.0: [[1752935928.516453670, {}], {"hostname"=>"quoted "host"", "business_line"=>"R&D"}]
[7] tail.0: [[1752935928.516455212, {}], {"hostname"=>"unknown-host"}]
[8] tail.0: [[1752935928.516456504, {}], {}]
[9] tail.0: [[1752935928.516458712, {}], {"hostname"=>[1, 2, 3]}]
[10] tail.0: [[1752935928.516460754, {}], {"hostname"=>{"sub"=>"val"}}]
[2025/07/19 14:38:49] [debug] [out flush] cb_destroy coro_id=0
[2025/07/19 14:38:49] [debug] [task] destroy task=0xffff9c043a10 (task_id=0)

Output shows correct matching and handling of different value types and correct output when no match is detected.

Valgrind summary (after run with multiple types of lookups):

==50220== HEAP SUMMARY:
==50220==     in use at exit: 0 bytes in 0 blocks
==50220==   total heap usage: 14,547 allocs, 14,550 frees, 74,987,419 bytes allocated
==50220== 
==50220== All heap blocks were freed -- no leaks are possible
==50220== 
==50220== Use --track-origins=yes to see where uninitialised values come from
==50220== For lists of detected and suppressed errors, rerun with: -s
==50220== ERROR SUMMARY: 6 errors from 4 contexts (suppressed: 0 from 0)

@olegmukhin
Copy link
Author

Documentation for this filter has been submitted as part of #fluent/fluent-bit-docs/pull/1953.

@olegmukhin
Copy link
Author

Added unit tests for lookup filter. All tests pass:

Test basic_lookup...                            [ OK ]
Test ignore_case...                             [ OK ]
Test csv_quotes...                              [ OK ]
Test numeric_values...                          [ OK ]
Test large_numbers...                           [ OK ]
Test boolean_values...                          [ OK ]
Test no_match...                                [ OK ]
Test long_csv_lines...                          [ OK ]
Test whitespace_trim...                         [ OK ]
Test dynamic_buffer...                          [ OK ]
Test nested_keys...                             [ OK ]
Test large_csv...                               [ OK ]
Test nested_array_keys...                       [ OK ]
Test metrics_matched...                         [ OK ]
Test metrics_processed...                       [ OK ]

Valgrind results are showing appropriate memory management.

==19111== HEAP SUMMARY:
==19111==     in use at exit: 0 bytes in 0 blocks
==19111==   total heap usage: 6,964 allocs, 6,964 frees, 59,096,180 bytes allocated
==19111== 
==19111== All heap blocks were freed -- no leaks are possible
==19111== 
==19111== Use --track-origins=yes to see where uninitialised values come from

@olegmukhin
Copy link
Author

Added fix for failing checks on Cent OS 7 and Windows. Please rerun.

@olegmukhin
Copy link
Author

olegmukhin commented Jul 21, 2025

Last check is failing due to Cent OS 7 incompatibility in unit test file - fix in last commit. Please rerun.

@olegmukhin
Copy link
Author

Could this please get one more run at the checks? I didn't realise we need this to compile on Cent OS 7 - should be good now with last commit.

@patrick-stephens
Copy link
Contributor

Can you rebase and push so it reruns tests?

Copy link

coderabbitai bot commented Sep 12, 2025

Walkthrough

Adds a new CSV-backed "lookup" filter plugin with build option and CMake wiring, implements plugin source and header (CSV parsing, hash table lookup, record accessor, metrics), and introduces comprehensive runtime tests gated on record accessor support.

Changes

Cohort / File(s) Summary of changes
Build option
cmake/plugins_options.cmake
Adds FLB_FILTER_LOOKUP option (Filters section, default ON, defined via DEFINE_OPTION).
Plugin registration
plugins/CMakeLists.txt
Registers filter_lookup in plugin list when FLB_RECORD_ACCESSOR is enabled.
Plugin build
plugins/filter_lookup/CMakeLists.txt
Adds CMake target for filter_lookup; skips build with status message when flb_record_accessor is disabled.
Plugin implementation
plugins/filter_lookup/lookup.c, plugins/filter_lookup/lookup.h
New lookup filter plugin: CSV loader (quoted/escaped fields, trimming, dynamic buffers), hash table storage, record accessor usage (supports nested/array keys), case-insensitive option, metrics counters, lifecycle callbacks (init/filter/exit), resource management, and exported filter_lookup_plugin + struct lookup_ctx.
Runtime tests
tests/runtime/CMakeLists.txt, tests/runtime/filter_lookup.c
Adds runtime test target gated by FLB_RECORD_ACCESSOR and comprehensive test suite covering CSV parsing, quoting, numeric/boolean values, long lines, whitespace trimming, nested keys/arrays, large CSV load, dynamic buffer unit test, and metrics assertions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant In as Input
  participant FL as filter_lookup
  participant CSV as CSV Loader
  participant HT as Hash Table
  participant Out as Output

  Note over FL: Initialization (cb_lookup_init)
  FL->>CSV: open CSV file & parse entries
  CSV->>HT: insert normalized key→value
  CSV-->>FL: return count

  loop per record/batch (cb_lookup_filter)
    In->>FL: encoded record(s)
    FL->>FL: extract lookup_key (record accessor)
    alt key present & HT hit
      FL->>HT: lookup(key)
      HT-->>FL: value
      FL->>FL: construct modified record (add result_key)
      FL-->>Out: emit modified record
    else miss or error
      FL-->>Out: emit original record
    end
  end

  Note over FL: Exit (cb_lookup_exit) — free RA, HT, stored values, metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

Poem

A nibble of CSV beneath my paw,
I map each key without a flaw.
When logs arrive I sniff and find,
the matching value, neatly lined.
Hop, add a field — enrichment done! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.57% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "filter_lookup: added filter for key value lookup" succinctly and accurately describes the primary change — adding a new lookup filter that enriches records via key/value lookups. It is concise, names the plugin, and provides enough context for a reviewer scanning history to understand the main intent. The title avoids unnecessary detail and aligns with the PR objectives and modified files.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4fef9e5 and 0673bcd.

📒 Files selected for processing (7)
  • cmake/plugins_options.cmake (1 hunks)
  • plugins/CMakeLists.txt (1 hunks)
  • plugins/filter_lookup/CMakeLists.txt (1 hunks)
  • plugins/filter_lookup/lookup.c (1 hunks)
  • plugins/filter_lookup/lookup.h (1 hunks)
  • tests/runtime/CMakeLists.txt (1 hunks)
  • tests/runtime/filter_lookup.c (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • tests/runtime/CMakeLists.txt
  • plugins/filter_lookup/CMakeLists.txt
  • plugins/CMakeLists.txt
  • plugins/filter_lookup/lookup.h
  • cmake/plugins_options.cmake
  • tests/runtime/filter_lookup.c
🧰 Additional context used
🧬 Code graph analysis (1)
plugins/filter_lookup/lookup.c (10)
tests/runtime/filter_lookup.c (3)
  • dynbuf_init (682-689)
  • dynbuf_append_char (691-702)
  • dynbuf_destroy (704-711)
include/fluent-bit/flb_mem.h (2)
  • flb_free (126-128)
  • flb_calloc (84-96)
src/flb_hash_table.c (4)
  • flb_hash_table_add (401-494)
  • flb_hash_table_create (99-137)
  • flb_hash_table_destroy (197-215)
  • flb_hash_table_get (496-522)
lib/cmetrics/src/cmt_counter.c (1)
  • cmt_counter_create (26-81)
src/flb_metrics.c (1)
  • flb_metrics_add (129-178)
include/fluent-bit/flb_filter.h (1)
  • flb_filter_config_map_set (125-129)
src/flb_record_accessor.c (3)
  • flb_ra_create (271-358)
  • flb_ra_destroy (232-248)
  • flb_ra_get_value_object (803-814)
src/flb_log_event_encoder.c (7)
  • flb_log_event_encoder_begin_record (246-254)
  • flb_log_event_encoder_set_timestamp (276-287)
  • flb_log_event_encoder_commit_record (256-274)
  • flb_log_event_encoder_rollback_record (241-244)
  • flb_log_event_encoder_init (42-74)
  • flb_log_event_encoder_claim_internal_buffer_ownership (118-124)
  • flb_log_event_encoder_destroy (99-116)
src/flb_log_event_decoder.c (3)
  • flb_log_event_decoder_init (99-116)
  • flb_log_event_decoder_destroy (147-179)
  • flb_log_event_decoder_next (310-406)
src/flb_ra_key.c (1)
  • flb_ra_key_value_destroy (842-851)
🔇 Additional comments (21)
plugins/filter_lookup/lookup.c (21)

1-42: LGTM! Standard headers and includes.

The header structure and includes look appropriate for a Fluent Bit filter plugin. Good coverage of necessary system headers and platform-specific handling for Windows.


45-77: LGTM! Well-designed metric macros with proper conditional compilation.

The macro design properly handles both the new cmetrics system and legacy metrics, with clean fallback when metrics are disabled. The cmetrics integration correctly uses timestamps and labels.


80-138: Well-implemented string normalization with proper memory management.

The normalize_and_trim function correctly handles edge cases (empty input, whitespace trimming) and uses clear return value conventions (0=no allocation, 1=allocated, -1=error). Memory allocation is handled safely.


140-190: LGTM! Dynamic buffer implementation follows best practices.

The dynamic buffer implementation uses proper exponential growth (capacity * 2), handles reallocation failures safely, and maintains null termination. The interface is clean and consistent.


192-260: Robust dynamic line reading implementation.

The function properly handles arbitrary line lengths, memory reallocation failures, and cross-platform line endings (\r\n). Good defensive programming with proper cleanup on allocation failures.


262-306: LGTM! CSV loading setup with proper error handling.

File opening, header skipping, and initialization of the value tracking list are all handled correctly with appropriate error messages and cleanup.


309-381: LGTM! Comprehensive CSV key parsing with quote handling.

The key parsing correctly handles quoted fields, escaped quotes (doubled quotes), and dynamic buffer growth. The logic for quote state management and field separation is sound.


393-455: LGTM! Value parsing mirrors key parsing with consistent quote handling.

The value parsing uses the same robust quote handling as the key parsing, including proper handling of escaped quotes and unmatched quote detection.


457-487: LGTM! Proper normalization and validation of parsed data.

The code correctly normalizes keys (with ignore_case) and values, handles allocation failures, and validates that neither key nor value is empty before proceeding.


554-677: LGTM! Comprehensive plugin initialization with proper validation.

The initialization function properly:

  • Sets up both cmetrics and legacy metrics
  • Validates required configuration parameters
  • Checks file accessibility with platform-specific functions
  • Creates hash table and record accessor
  • Loads CSV data and reports success

Error handling includes proper cleanup of allocated resources.


679-706: LGTM! Clean helper function for emitting unchanged records.

The emit_original_record function properly handles the log event encoding pipeline with appropriate error handling and metric updates.


708-755: LGTM! Robust filter initialization with proper validation.

The main filter function correctly initializes decoders/encoders, validates context, and handles initialization failures with appropriate cleanup and fallback to NOTOUCH.


756-796: LGTM! Proper record processing setup with good validation.

The record processing loop correctly increments processed metrics, validates record structure (must be a map), and uses the record accessor to extract lookup values. Non-map records are handled gracefully.


798-807: LGTM! String value extraction with proper normalization.

String values from the record accessor are correctly normalized using the same function as CSV loading, maintaining consistency in case handling and whitespace trimming.


850-917: LGTM! Robust memory management for dynamic value conversion.

The code properly allocates buffers based on calculated size, performs the actual formatting, validates output, and handles the complex memory ownership between the dynamic buffer and normalized string. The delayed cleanup logic is sound.


928-951: LGTM! Hash table lookup with proper match handling.

The hash table lookup correctly handles not-found cases, increments match metrics on success, and includes good trace logging for debugging. Cleanup is performed before proceeding to record encoding.


953-1003: LGTM! Comprehensive record encoding with proper error handling.

The record encoding process correctly:

  • Begins new record and sets timestamp/metadata
  • Copies all original key-value pairs except conflicting result_key
  • Handles encoding failures with rollback and fallback to original record
  • Maintains proper error metrics throughout

1005-1032: LGTM! Final record assembly with result key addition.

The code correctly adds the result_key and found value to the record, with proper error handling and rollback on failure. The commit process is handled safely.


1040-1052: LGTM! Optimal buffer management strategy.

The function correctly returns FLB_FILTER_MODIFIED only when records were actually changed, avoiding unnecessary buffer copies when no lookups succeeded. Proper cleanup of encoders/decoders is performed.


1054-1073: LGTM! Complete cleanup in exit function.

The exit function properly frees all tracked values in the val_list, destroys the record accessor and hash table, and frees the context. The cleanup is comprehensive and safe.


1075-1091: LGTM! Well-documented configuration map and plugin registration.

The configuration map properly defines all required and optional parameters with appropriate types, descriptions, and struct offsets. The plugin registration structure is complete and correct.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (10)
plugins/filter_lookup/lookup.h (1)

28-31: Use an enum for metric IDs and confirm legacy metrics exposure

Minor: prefer an enum to group these IDs and avoid macro leakage. Also, since the runtime tests fetch metrics via flb_metrics_get_id(), please confirm the plugin populates the legacy metrics list (f_ins->metrics) in addition to cmetrics; otherwise tests on builds without FLB_HAVE_METRICS will fail.

Apply:

-#define FLB_LOOKUP_METRIC_PROCESSED     200
-#define FLB_LOOKUP_METRIC_MATCHED       201
-#define FLB_LOOKUP_METRIC_SKIPPED       202
+enum {
+    FLB_LOOKUP_METRIC_PROCESSED = 200,
+    FLB_LOOKUP_METRIC_MATCHED   = 201,
+    FLB_LOOKUP_METRIC_SKIPPED   = 202
+};
tests/runtime/filter_lookup.c (2)

125-145: Don’t assume null-terminated output; respect size

The lib output callback receives a buffer and length. Using strstr() on an assumed NUL-terminated buffer is brittle. Copy to a NUL-terminated scratch buffer first.

 static int cb_check_result_json(void *record, size_t size, void *data)
 {
-    char *p;
-    char *expected;
-    char *result;
+    char *p;
+    char *expected;
+    char *result;
 
     expected = (char *) data;
-    result = (char *) record;
+    result = flb_malloc(size + 1);
+    if (!result) {
+        flb_free(record);
+        return -1;
+    }
+    memcpy(result, record, size);
+    result[size] = '\0';
 
     p = strstr(result, expected);
     TEST_CHECK(p != NULL);
 
     if (p == NULL) {
         flb_error("Expected to find: '%s' in result '%s'",
                   expected, result);
     }
 
-    flb_free(record);
+    flb_free(result);
+    flb_free(record);
     return 0;
 }

449-497: Strengthen “no match” test: assert result_key is absent

Currently this test only checks that another field exists. Add a negative assertion to ensure the filter didn’t inject result_key on misses.

+/* Callback to assert a substring is NOT present */
+static int cb_assert_not_contains(void *record, size_t size, void *data)
+{
+    char *needle = (char *) data;
+    char *buf = flb_malloc(size + 1);
+    if (!buf) { flb_free(record); return -1; }
+    memcpy(buf, record, size);
+    buf[size] = '\0';
+    TEST_CHECK(strstr(buf, needle) == NULL);
+    if (strstr(buf, needle) != NULL) {
+        flb_error("Unexpected substring '%s' found in: '%s'", needle, buf);
+    }
+    flb_free(buf);
+    flb_free(record);
+    return 0;
+}
@@
-    /* Should NOT contain the result_key since no match was found */
-    cb_data.cb = cb_check_result_json;
-    cb_data.data = "\"other_field\":\"test\"";
+    /* Should NOT contain result_key since no match was found */
+    cb_data.cb = cb_assert_not_contains;
+    cb_data.data = "\"user_name\":";
plugins/filter_lookup/lookup.c (7)

766-768: Float formatting may not match CSV keys; use compact representation.

%f prints 6 decimals (e.g., 1.01.000000). Prefer %.15g to preserve intent and improve match rates.

-                case FLB_RA_FLOAT:
-                    required_size = snprintf(NULL, 0, "%f", rval->o.via.f64);
+                case FLB_RA_FLOAT:
+                    required_size = snprintf(NULL, 0, "%.15g", rval->o.via.f64);
                     break;
...
-                case FLB_RA_FLOAT:
-                    printed = snprintf(dynamic_val_buf, required_size + 1, "%f", rval->o.via.f64);
+                case FLB_RA_FLOAT:
+                    printed = snprintf(dynamic_val_buf, required_size + 1, "%.15g", rval->o.via.f64);
                     break;

Also applies to: 815-817


46-74: Metrics macros: avoid static mutable storage and guard counter pointers.

static char* labels_array[1]; is shared across threads. Also, cmt_counter_create() can fail; guard ctx->cmt_* in the macros.

-#define INCREMENT_SKIPPED_METRIC(ctx, ins) do { \
-    uint64_t ts = cfl_time_now(); \
-    static char* labels_array[1]; \
-    labels_array[0] = (char*)flb_filter_name(ins); \
-    cmt_counter_add(ctx->cmt_skipped, ts, 1, 1, labels_array); \
-    flb_metrics_sum(FLB_LOOKUP_METRIC_SKIPPED, 1, ins->metrics); \
-} while(0)
+#define INCREMENT_SKIPPED_METRIC(ctx, ins) do { \
+    uint64_t ts = cfl_time_now(); \
+    char* labels_array[1]; \
+    labels_array[0] = (char *) flb_filter_name(ins); \
+    if ((ctx)->cmt_skipped) { cmt_counter_add((ctx)->cmt_skipped, ts, 1, 1, labels_array); } \
+    if ((ins)->metrics) { flb_metrics_sum(FLB_LOOKUP_METRIC_SKIPPED, 1, (ins)->metrics); } \
+} while (0)

Apply the same pattern to INCREMENT_MATCHED_METRIC and INCREMENT_PROCESSED_METRIC.


529-547: Defensive checks on metric creation.

If any cmt_counter_create() returns NULL, later increments will segfault without guards. Consider logging and continuing without those counters.

-        ctx->cmt_processed = cmt_counter_create(ins->cmt,
+        ctx->cmt_processed = cmt_counter_create(ins->cmt,
                                                 "fluentbit", "filter", "lookup_processed_records_total",
                                                 "Total number of processed records",
                                                 1, labels_name);
+        if (!ctx->cmt_processed) { flb_plg_warn(ins, "failed to create processed counter"); }
...
-        ctx->cmt_matched = cmt_counter_create(ins->cmt,
+        ctx->cmt_matched = cmt_counter_create(ins->cmt,
                                               "fluentbit", "filter", "lookup_matched_records_total",
                                               "Total number of matched records",
                                               1, labels_name);
+        if (!ctx->cmt_matched) { flb_plg_warn(ins, "failed to create matched counter"); }
...
-        ctx->cmt_skipped = cmt_counter_create(ins->cmt,
+        ctx->cmt_skipped = cmt_counter_create(ins->cmt,
                                               "fluentbit", "filter", "lookup_skipped_records_total",
                                               "Total number of skipped records due to errors",
                                               1, labels_name);
+        if (!ctx->cmt_skipped) { flb_plg_warn(ins, "failed to create skipped counter"); }

261-269: Always skipping the first CSV line assumes a header. Make this configurable.

Not all CSVs include a header; the first record would be dropped. Add a skip_header (bool) option or auto-detect, and document the behavior.

I can draft the config plumb-through and tests if you want it in this PR.


606-609: Avoid relying on internal ht->total_count.

Accessing struct internals risks future ABI churn. Track a local “loaded entries” count in load_csv() and log that instead.


919-938: Precompute result_key length and use memcmp.

Minor perf/readability tweak in the hot path.

-        if (log_event.body && log_event.body->type == MSGPACK_OBJECT_MAP) {
+        if (log_event.body && log_event.body->type == MSGPACK_OBJECT_MAP) {
             int i;
+            size_t rkey_len = strlen(ctx->result_key);
             for (i = 0; i < log_event.body->via.map.size; i++) {
                 msgpack_object_kv *kv = &log_event.body->via.map.ptr[i];
                 if (kv->key.type == MSGPACK_OBJECT_STR &&
-                    kv->key.via.str.size == strlen(ctx->result_key) &&
-                    strncmp(kv->key.via.str.ptr, ctx->result_key, kv->key.via.str.size) == 0) {
+                    kv->key.via.str.size == rkey_len &&
+                    memcmp(kv->key.via.str.ptr, ctx->result_key, rkey_len) == 0) {
                     continue;
                 }

183-244: CSV support limitations (multiline fields).

Reader splits on newline before CSV parsing; embedded newlines within quoted fields aren’t supported. If out-of-scope, document explicitly.

Also applies to: 291-417

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7798a84 and 45c53a1.

📒 Files selected for processing (7)
  • cmake/plugins_options.cmake (1 hunks)
  • plugins/CMakeLists.txt (1 hunks)
  • plugins/filter_lookup/CMakeLists.txt (1 hunks)
  • plugins/filter_lookup/lookup.c (1 hunks)
  • plugins/filter_lookup/lookup.h (1 hunks)
  • tests/runtime/CMakeLists.txt (1 hunks)
  • tests/runtime/filter_lookup.c (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
plugins/filter_lookup/lookup.c (7)
include/fluent-bit/flb_mem.h (2)
  • flb_free (126-128)
  • flb_calloc (84-96)
src/flb_hash_table.c (4)
  • flb_hash_table_add (401-494)
  • flb_hash_table_create (99-137)
  • flb_hash_table_destroy (197-215)
  • flb_hash_table_get (496-522)
include/fluent-bit/flb_filter.h (1)
  • flb_filter_config_map_set (125-129)
src/flb_record_accessor.c (3)
  • flb_ra_create (271-358)
  • flb_ra_destroy (232-248)
  • flb_ra_get_value_object (803-814)
src/flb_log_event_encoder.c (7)
  • flb_log_event_encoder_begin_record (246-254)
  • flb_log_event_encoder_set_timestamp (276-287)
  • flb_log_event_encoder_commit_record (256-274)
  • flb_log_event_encoder_rollback_record (241-244)
  • flb_log_event_encoder_init (42-74)
  • flb_log_event_encoder_claim_internal_buffer_ownership (118-124)
  • flb_log_event_encoder_destroy (99-116)
src/flb_log_event_decoder.c (3)
  • flb_log_event_decoder_init (99-116)
  • flb_log_event_decoder_destroy (147-179)
  • flb_log_event_decoder_next (310-406)
src/flb_ra_key.c (1)
  • flb_ra_key_value_destroy (842-851)
tests/runtime/filter_lookup.c (4)
src/flb_lib.c (11)
  • flb_create (138-220)
  • flb_input (261-271)
  • flb_input_set (300-330)
  • flb_filter (287-297)
  • flb_output (274-284)
  • flb_output_set (515-546)
  • flb_stop (942-985)
  • flb_destroy (223-258)
  • flb_filter_set (613-644)
  • flb_start (914-925)
  • flb_lib_push (774-801)
include/fluent-bit/flb_mem.h (1)
  • flb_free (126-128)
plugins/filter_lookup/lookup.c (3)
  • dynbuf_init (141-151)
  • dynbuf_append_char (154-169)
  • dynbuf_destroy (172-180)
src/flb_metrics.c (1)
  • flb_metrics_get_id (62-75)
🔇 Additional comments (3)
cmake/plugins_options.cmake (1)

89-89: New lookup filter option — LGTM

Option is correctly added and follows the FLB_MINIMAL override behavior.

plugins/filter_lookup/lookup.c (2)

1-1028: Overall: solid, careful implementation.

CSV parsing with proper escaping, case-insensitive matching via normalized keys, RA integration, encoder/decoder usage, and exhaustive cleanup paths are well-structured. Nice work.


990-1007: Ownership verified — hashtable copies/frees its own buffer when val_size > 0 (no double-free).

entry_set_value() (flb_hash_table.c) allocates and copies the supplied buffer when val_size > 0 and flb_hash_table_entry_free()/flb_hash_table_destroy() free that internal copy; the plugin allocates val_heap, stores it in ctx->val_list and frees it in cb_lookup_exit — these are distinct allocations, so the current cleanup is correct.

New filter aims to address use case of simple data enrichment using
static key value lookup.

The filter loads first two columns of CSV file into memory as a hash
table. When a specified record value matches the key in the hash table
the value will be appended to the record (based on key name defined
in the filter inputs).)

Tested with valgrind.

Signed-off-by: Oleg Mukhin <[email protected]>
- Removed unecessary FLB_FILTER_LOOKUP build flag now LookUp
is enabled by default like other filters (without flag).
- Fixed critical use-after-free bug in numeric value lookups.
- Added processed_records_total, matched_records_total and
skipped_records_total metrics to enable operational visibility
- Added unit tests to cover handling of different data types,
CSV loading/handling and metrics tests.

Tested with valgrind - no memory leaks. All unit tests pass.

Signed-off-by: Oleg Mukhin <[email protected]>
- fix variable declarations and remove C99 features
- Conditional compilation for Windows vs Unix headers/functions
- Replace bool with int, fix format specifiers, update comments

All 15 unit tests for filter passed.

Signed-off-by: Oleg Mukhin <[email protected]>
- fix variable declarations and remove C99 features for unit tests
- Conditional compilation for Windows for unit test features

All 15 unit tests for filter passed.

Signed-off-by: Oleg Mukhin <[email protected]>
Addressed following issues:
Fix potential memory leak when val_node allocation fails
Wrap test metrics code with FLB_HAVE_METRICS guards
Replace metric macros with enum to prevent namespace pollution
Gate plugin registration on FLB_RECORD_ACCESSOR option
Add unmatched quote detection after key parsing in CSV loader
Replace magic numbers with semantic msgpack type checking
Fix thread safety in lookup filter metrics macros
Eliminated potential segfaults from null pointer dereferences
Added defensive checks to the metric creation code
Optimise hot path by eliminating repeated strlen calls

Signed-off-by: Oleg Mukhin <[email protected]>
@olegmukhin
Copy link
Author

@patrick-stephens rebaselined to latest commit as advised, built/deployed without issues (related to the plug-in), and also addressed all the new review comments from the AI reviewer in the last commit. Let me know if there are any further changes required - thanks.

I see some potential improvements (such as multiline CSV data support, JSON file support, support for occasionally checking for new data in lookup files, etc.), but I think these should be separate PRs to keep things clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants